49 research outputs found
Enforcing Predictability of Many-cores with DCFNoC
© 2021 IEEE. Personal use of this material is permitted. PermissĂon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisĂng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] The ever need for higher performance forces industry to include technology based on multi-processors system on chip (MPSoCs) in their safety-critical embedded systems. MPSoCs include a network-on-chip (NoC) to interconnect the cores between them and with memory and the rest of shared resources. Unfortunately, the inclusion of NoCs compromises guaranteeing time predictability as network-level conflicts may occur. To overcome this problem, in this paper we propose DCFNoC, a new time-predictable NoC design paradigm where conflicts within the network are eliminated by design. This new paradigm builds on top of the Channel Dependency Graph (CDG) in order to deterministically avoid network conflicts. The network guarantees predictability to applications and is able to naturally inject messages using a TDM period equal to the optimal theoretical bound without the need of using a computationally demanding offline process. DCFNoC is integrated in a tile-based many-core system and adapted to its memory hierarchy. Our results show that DCFNoC guarantees time predictability avoiding network interference among multiple running applications. DCFNoC always guarantees performance and also improves wormhole performance in a 4 Ă— 4 setting by a factor of 3.7Ă— when interference traffic is injected. For a 8 Ă— 8 network differences are even larger. In addition, DCFNoC obtains a total area saving of 10.79% over a standard wormhole implementation.This work has been supported by MINECO under Grant BES-2016-076885, by MINECO and funds from the European ERDF under Grant TIN2015-66972-C05-1-R and Grant RTI2018-098156-B-C51, and by the EC H2020 RECIPE project under Grant 801137.Picornell-Sanjuan, T.; Flich Cardo, J.; Hernández Luz, C.; Duato MarĂn, JF. (2021). Enforcing Predictability of Many-cores with DCFNoC. IEEE Transactions on Computers. 70(2):270-283. https://doi.org/10.1109/TC.2020.2987797S27028370
Efficient and scalable starvation prevention mechanism for token coherence
[EN] Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional
approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on bus-like
interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in
resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses
from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation
prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it endangers the scalability of
Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While
multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence
protocols will continue to be key aspects. In this work, we propose an alternative starvation prevention mechanism, named priority
requests, that outperforms the persistent request one. This mechanism is able to reduce the application runtime more than 20 percent
(on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically
minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the
expense of a slight performance degradation, priority requests still outperform persistent requests significantly.This work was partially supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. Antonio Robles is taking a sabbatical granted by the Universidad Politecnica de Valencia for updating his teaching and research activities.Cuesta Sáez, BA.; Robles MartĂnez, A.; Duato MarĂn, JF. (2011). Efficient and scalable starvation prevention mechanism for token coherence. IEEE Transactions on Parallel and Distributed Systems. 22(10):1610-1623. doi:10.1109/TPDS.2011.30S16101623221
HP-DCFNoC: High Performance Distributed Dynamic TDM Scheduler Based on DCFNoC Theory
(c) 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] The need for increasing the performance of critical real-time embedded systems pushes the industry to adopt complex multi-core processor designs with embedded networks-on-chip. In this paper we present hp-DCFNoC, a distributed dynamic scheduler design that by relying on the key properties of a delayed confict-free NoC (DCFNoC) is able to achieve peak performance numbers very close to a wormhole-based NoC design without compromising its real-time guarantees. In particular, our results show that the proposed scheduler achieves an overall throughput improvement of 6.9x and 14.4x over a baseline DCFNoC for 16 and 64-node meshes, respectively. When compared against a standard wormhole router 95% of its network throughput is preserved while strict timing predictability as property is kept. This achievement opens the door to new high performance time predictable NoC designs.This work was supported in part by the Secretara de Estado de Investigacin Desarrollo e Innovacin (MINECO) under Grant BES-2016-076885, in part by the European Regional Development Fund (ERDF) under Grant TIN2015-66972-C05-1-R and Grant RTI2018-098156-B-C51, and in part by the EC H2020 European Institute of Innovation and Technology (SELENE) Project under Grant 871467.Picornell-Sanjuan, T.; Flich Cardo, J.; Duato MarĂn, JF.; Hernández Luz, C. (2020). HP-DCFNoC: High Performance Distributed Dynamic TDM Scheduler Based on DCFNoC Theory. IEEE Access. 8:194836-194849. https://doi.org/10.1109/ACCESS.2020.3033853S194836194849
Accurately modeling the on-chip and off-chip GPU memory subsystem
[EN] Research on GPU architecture is becoming pervasive in both the academia and the industry because these architectures offer much more performance per watt than typical CPU architectures. This is the main reason why massive deployment of GPU multiprocessors is considered one of the most feasible solutions to attain exascale computing capabilities.
The memory hierarchy of the GPU is a critical research topic, since its design goals widely differ from those of conventional CPU memory hierarchies. Researchers typically use detailed microarchitectural simulators to explore novel designs to better support GPGPU computing as well as to improve the performance of GPU and CPU-GPU systems. In this context, the memory hierarchy is a critical and continuously evolving subsystem.
Unfortunately, the fast evolution of current memory subsystems deteriorates the accuracy of existing state-of-the-art simulators. This paper focuses on accurately modeling the entire (both on-chip and off-chip) GPU memory subsystem. For this purpose, we identify four main memory related components that impact on the overall performance accuracy. Three of them belong to the on-chip memory hierarchy: (i) memory request coalescing mechanisms, (ii) miss status holding registers, and (iii) cache coherence protocol; while the fourth component refers to the memory controller and GDDR memory working activity.
To evaluate and quantify our claims, we accurately modeled the aforementioned memory components in an extended version of the state-of-the-art Multi2Sim heterogeneous CPUGPU processor simulator. Experimental results show important deviations, which can vary the final system performance provided by the simulation framework up to a factor of three. The proposed GPU model has been compared and validated against the original framework and the results from a real AMD Southern-Islands 7870HD GPU. (C) 2017 Elsevier B.V. All rights reserved.This work was supported in part by Generalitat Valenciana under grant AICO/2016/059, by the Spanish Ministerio de EconomĂa y Competitividad (MINECO) and Plan E funds under Grant TIN2015-66972-C5-1-R, and by Programa de Ayudas de InvestigaciĂłn y Desarrollo (PAID) de la Universitat Politècnica de València .Candel-Margaix, F.; Petit MartĂ, SV.; Sahuquillo Borrás, J.; Duato MarĂn, JF. (2018). Accurately modeling the on-chip and off-chip GPU memory subsystem. Future Generation Computer Systems. 82:510-519. https://doi.org/10.1016/j.future.2017.02.012S5105198
L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Improving the utilization of shared resources is a
key issue to increase performance in SMT processors. Recent
work has focused on resource sharing policies to enhance the
processor performance, but their proposals mainly concentrate on
novel hardware mechanisms that adapt to the dynamic resource
requirements of the running threads.
This work addresses the L1 cache bandwidth problem in SMT
processors experimentally on real hardware. Unlike previous
work, this paper concentrates on thread allocation, by selecting
the proper pair of co-runners to be launched to the same
core. The relation between L1 bandwidth requirements of each
benchmark and its performance (IPC) is analyzed. We found that
for individual benchmarks, performance is strongly connected to
L1 bandwidth consumption, and this observation remains valid
when several co-runners are launched to the same SMT core.
Based on these findings we propose two L1 bandwidth
aware thread to core (t2c) allocation policies, namely Static and
Dynamic t2c allocation, respectively. The aim of these policies is
to properly balance L1 bandwidth requirements of the running
threads among the processor cores. Experiments on a Xeon E5645
processor show that the proposed policies significantly improve
the performance of the Linux OS kernel regardless the number
of cores considered.This work was supported by the Spanish Ministerio de
Econom´ıa y Competitividad (MINECO) and by FEDER funds
under Grant TIN2012-38341-C04-01; and by Programa de
Apoyo a la Investigacion y Desarrollo (PAID-05-12) of the ´
Universitat Politecnica de Val ` encia under Grant SP20120748Feliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2013). L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. IEEE. https://doi.org/10.1109/PACT.2013.6618810
Cache-Hierarchy contention-aware scheduling in CMPs
© © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksTo improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01, and by the Universitat Politecnica de Valencia under Grant PAID-05-12 SP20120748.Feliu PĂ©rez, J.; Petit MartĂ, SV.; Sahuquillo Borrás, J.; Duato MarĂn, JF. (2014). Cache-Hierarchy contention-aware scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems. 25(3):581-590. https://doi.org/10.1109/TPDS.2013.61S58159025
Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches
© Owner/Author 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '13 Proceedings of the 27th international ACM conference on International conference on supercomputing; http://dx.doi.org/10.1145/2464996.2467278.This work introduces a novel refresh mechanism that leverages
reuse information to decide which blocks should be refreshed in an
energy-aware eDRAM last-level cache. Experimental results show
that, compared to a conventional eDRAM cache, the energy-aware
approach achieves refresh energy savings up to 71%, while the reduction
on the overall dynamic energy is by 65% with negligible
performance losses.This work was supported by the Spanish Ministerio de EconomĂa y Competitividad (MINECO) and Plan E funds, under Grants TIN-2009-14475-C04-01 and TIN2012-38341-C04-01.Valero BresĂł, A.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2013). Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches. ACM. https://doi.org/10.1145/2464996.2467278
Addressing fairness in SMT multicores with a progress-aware scheduler
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling.
This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness.
Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3Ă— over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de
Econom´ıa y Competitividad (MINECO) and Plan E funds,
under Grant TIN2012-38341-C04-01, and by the Intel Early
Career Faculty Honor Program AwardFeliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48
Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores
[EN] Nowadays, high performance multicore processors implement
multithreading capabilities. The processes running concurrently on these
processors are continuously competing for the shared resources, not only among
cores, but also within the core. While resource sharing increases the resource
utilization, the interference among processes accessing the shared resources can
strongly affect the performance of individual processes and its predictability. In this
scenario, process scheduling plays a key role to deal with performance and
fairness. In this work we present a process scheduler for SMT multicores that
simultaneously addresses both performance and fairness. This is a major design
issue since scheduling for only one of the two targets tends to damage the other.
To address performance, the scheduler tackles bandwidth contention at the L1
cache and main memory. To deal with fairness, the scheduler estimates the
progress experienced by the processes, and gives priority to the processes with
lower accumulated progress. Experimental results on an Intel Xeon E5645
featuring six dual-threaded SMT cores show that the proposed scheduler improves
both performance and fairness over two state-of-the-art schedulers and the Linux
OS scheduler. Compared to Linux, unfairness is reduced to a half while still
improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166
Constructing virtual 5-dimensional tori out of lower-dimensional network cards
[EN] In the Top500 and Graph500 lists of the last years, some of the most powerful systems implement
a torus topology to interconnect themillions of computing nodes they include. Some of these torus
networks are of five or six dimensions, which implies an additional difficulty as the node degree
increases. In previous works, we proposed and evaluated the nD Twin (nDT) torus topology to virtually
increase the dimensions a torus is able to implement. We showed that this new topology
reduces the distances between nodes, increasing, therefore, global network performance. In this
work, we present how to build a 5DT torus network using a specific commercial 6-port network
card (EXTOLL card) to interconnect those nodes. We show, using the same number of cards, that
the performance of the 5DT torus network we are able to implement using our proposal is higher
than the performance of the 3D torus network for the same number of compute nodes.Spanish MINECO; European Commission, Grant/Award Number: TIN2015-66972-C5-1-R and TIN2015-66972-C5-2-R; JCCM, Grant/Award Number: PEII-2014-028-P; Spanish MICINN, Grant/Award Number: FJCI-2015-26080AndĂşjar-Muñoz, FJ.; Villar, JA.; Sanchez Garcia, JL.; Alfaro Cortes, FJ.; Duato MarĂn, JF.; Fröning, H. (2017). Constructing virtual 5-dimensional tori out of lower-dimensional network cards. Concurrency and Computation Practice and Experience. 1-17. https://doi.org/10.1002/cpe.4361S11